library(MASS)
require(jpeg)
## Loading required package: jpeg
require(softImpute)
## Loading required package: softImpute
## Loading required package: Matrix
## Loaded softImpute 1.4
library(pls)
## 
## Attaching package: 'pls'
## 
## The following object is masked from 'package:stats':
## 
##     loadings
library(ISLR)

Principle Component Analysis

PCA creates a new set of features (a new coordinate system) which are each a linear combination of the original features. The goal here is to (i) capture most of the “structure” in the data with a relatively low-dimensional linear subspace and (ii) potentially use the new low-D feature space as a remedy for the curse of dimensionality.

Geometric Interpretation

In which direction(s) does the data vary the most? Can we create a new coordinate system that emphasizes those directions instead of the original measured features?

set.seed(123)
x <- mvrnorm(n=100, mu=c(3,3), Sigma=matrix(c(.2,2,.1,.2), byrow=TRUE,nrow=2) )
plot(x, xlim=c(0,5),ylim=c(0,5))
abline(0,1,col='red')

We can also think of the first principle component at the linear subspace that is “closest” to the data. Quite similar to regression in that sense.

Terminology: * Loadings * Scores * Proportion of Variance explained * Scree plot * biplot (this is an R function)

Algebraic Interpretation

Diagonalization of Covariance Matrix

From the covariance matrix of \(X\), \(C = \frac{1}{n-1} X^{\textrm{T}}X\). The eigenvectors of the covariance matrix are the principle components because they point in the directions of the strongest variance of the data.

  1. Mean center the data
  2. compute Covariance \(X'X\)
  3. Compute eigenvectors of covariance matrix
x <- mvrnorm(n=100, mu=c(3,3), Sigma=matrix(c(.2,2,.1,.2), byrow=TRUE,nrow=2) )
x_centered = x - c(3,3)
covariance_mat <- (1/99) * t(x_centered) %*% x_centered
eig <- eigen(covariance_mat)
eig$vectors
##            [,1]       [,2]
## [1,] -0.7417674  0.6706572
## [2,] -0.6706572 -0.7417674
plot(x_centered)
abline(0, eig$vectors[1,1]/eig$vectors[2,1])

PCA can also be understood as the singular value decomposition of the data matrix X. Don’t worry if you don’t recall what SVD is, just understand it to be a generalization of eigen-decomposition to non-square matrices.

Code

Let’s return to the r functions and gain some more intution about principle components, loadings, and scores.

Load iris data and make pairs plot.

 data("iris")
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
iris0 = iris[,-5]
pairs(iris0,col = rep(2:4, each = 50), lwd = 3)

Compute principal components and make biplot.

iris.pca = prcomp(iris0,scale=F)
biplot(iris.pca, cex = .6)

Check what has been computed.

names(iris.pca)
## [1] "sdev"     "rotation" "center"   "scale"    "x"
iris.pca$sdev
## [1] 2.0562689 0.4926162 0.2796596 0.1543862
iris.pca$rotation
##                      PC1         PC2         PC3        PC4
## Sepal.Length  0.36138659 -0.65658877  0.58202985  0.3154872
## Sepal.Width  -0.08452251 -0.73016143 -0.59791083 -0.3197231
## Petal.Length  0.85667061  0.17337266 -0.07623608 -0.4798390
## Petal.Width   0.35828920  0.07548102 -0.54583143  0.7536574

The rotation matrix is orthogonal. It satisfies \(R^TR RR^T= I\), the identity matrix.

R <- iris.pca$rotation
t(R)%*%R
##               PC1           PC2           PC3           PC4
## PC1  1.000000e+00 -1.630640e-16 -8.326673e-17  5.551115e-17
## PC2 -1.630640e-16  1.000000e+00  6.938894e-17  6.245005e-17
## PC3 -8.326673e-17  6.938894e-17  1.000000e+00 -5.551115e-17
## PC4  5.551115e-17  6.245005e-17 -5.551115e-17  1.000000e+00
R%*%t(R)
##              Sepal.Length   Sepal.Width Petal.Length   Petal.Width
## Sepal.Length 1.000000e+00  6.245005e-16 2.498002e-16  2.775558e-16
## Sepal.Width  6.245005e-16  1.000000e+00 0.000000e+00 -2.220446e-16
## Petal.Length 2.498002e-16  0.000000e+00 1.000000e+00  1.110223e-16
## Petal.Width  2.775558e-16 -2.220446e-16 1.110223e-16  1.000000e+00

The first principal component is maximal for petal length. That is related to the fact that that variable has the largest variance (See the diagonal entry of the covariance matrix below).

var(iris0)
##              Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length    0.6856935  -0.0424340    1.2743154   0.5162707
## Sepal.Width    -0.0424340   0.1899794   -0.3296564  -0.1216394
## Petal.Length    1.2743154  -0.3296564    3.1162779   1.2956094
## Petal.Width     0.5162707  -0.1216394    1.2956094   0.5810063

Here are the other variables that are computed by .

iris.pca$center
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##     5.843333     3.057333     3.758000     1.199333
iris.pca$scale
## [1] FALSE
head(iris.pca$x)
##            PC1        PC2         PC3          PC4
## [1,] -2.684126 -0.3193972  0.02791483  0.002262437
## [2,] -2.714142  0.1770012  0.21046427  0.099026550
## [3,] -2.888991  0.1449494 -0.01790026  0.019968390
## [4,] -2.745343  0.3182990 -0.03155937 -0.075575817
## [5,] -2.728717 -0.3267545 -0.09007924 -0.061258593
## [6,] -2.280860 -0.7413304 -0.16867766 -0.024200858

The matrix in is obtained by subtracting the column means in from each column and multiplying the result from the right with the matrix in . Check this for the first row:

as.matrix(iris0[1,]-iris.pca$center)%*%as.matrix(iris.pca$rotation) - iris.pca$x[1,]
##   PC1 PC2 PC3 PC4
## 1   0   0   0   0

The first two entries of each row of are plotted in the biplot. The plot symbol is simply the row number. We checked this by looking at the coordinates for a few rows.

iris.pca$x[42,1:2]
##        PC1        PC2 
## -2.8493687  0.9409606
iris.pca$x[16,1:2]
##       PC1       PC2 
## -2.386039 -1.338062

Make a plot of the first and second components of all observations, colored by species.

plot(iris.pca$x[,1],iris.pca$x[,2],col = rep(2:4, each = 50), lwd = 3)

We then repeated all this with scaling turned on, i.e.

iris.pca1 = prcomp(iris0,scale=T)
plot(iris.pca1$x[,1],iris.pca1$x[,2],col = rep(2:4, each = 50), lwd = 3)

Image Example

#define a couple useful functions
set.seed(123)
imagegray = function(A){image(A,col = gray((0:255)/256))} # plot grayscale image
mytrunc = function(x, top = 1, bot = 0){pmax(pmin(x,top),bot)} 

Make sure that the JPEG files and are in the same directory as this .Rmd file.

Read a JPEG file, turn it into a grayscale file, rearrange rows so it can be plotted, and plot the image.

A = readJPEG("melencolia.JPG")
dim(A)
## [1] 1011  800    3
A.gray = .2989*A[,,1] + .5870*A[,,2] + .1140*A[,,3]
A.gray = t(A.gray[seq(dim(A.gray)[1],1,by = -1),])
A.gray[1:5,1:5]
##      [,1]   [,2]      [,3]      [,4]      [,5]
## [1,]    0 0.9999 0.9175553 0.9256757 0.9256757
## [2,]    0 0.9999 0.9175553 0.9256757 0.9256757
## [3,]    0 0.9999 0.9175553 0.9256757 0.9256757
## [4,]    0 0.9999 0.9175553 0.9256757 0.9256757
## [5,]    0 0.9999 0.9175553 0.9256757 0.9256757
imagegray(A.gray)

More on this engraving from 1514 may be found atat https://en.wikipedia.org/wiki/Melencolia_I.

The grayscale matrix has all values between 0 and 1. Entries closer to 0 are plotted as darker pixels.

Compute the singular value decomposition of the matrix and plot the logarithms of the singular values.

A.svd = svd(A.gray)
names(A.svd)
## [1] "d" "u" "v"
plot(log10(A.svd$d), main = "Log singular values")
grid(col = 2)

Approximate the grayscale matrix with a rank 30 approximation, obtained from the svd.

k = 30
A.approx = A.svd$u[,1:k]%*%diag(A.svd$d[1:k])%*%t(A.svd$v[,1:k])
imagegray(A.approx+1/2)

The function rescales entries in the matrix so that they lie between 0 and 1 and then plots a grayscale image. As it turns out, the approximating matrix has many entries \(>1\). The overall effect is that contrast is reduced. We can restore contrast by truncating the approximating matrix and then plotting it. we can also plot the difference between the correct and the approximating matrix. This shows that the approximation leaves out a lot of detail, in particular small-scale features and edges.

imagegray(mytrunc(A.approx))

imagegray(A.gray-A.approx)

Redo this with a rank 100 approximation. A lot more detail is now visible.

k = 100
A.approx = A.svd$u[,1:k]%*%diag(A.svd$d[1:k])%*%t(A.svd$v[,1:k])
imagegray(mytrunc(A.approx))

Now work with an image detail (the magic square). A rank 2 approximation preferred uses the boxes quite faithfully. A rank 50 approximation allows us to read all figures in the magic square and shows details of the wing, but fine detail (diagonal hashmarks in the square) are largely left out.

A = readJPEG("melencoliaDetail.JPG")
A.gray = .2989*A[,,1] + .5870*A[,,2] + .1140*A[,,3]
A.gray = t(A.gray[seq(dim(A.gray)[1],1,by = -1),])
imagegray(A.gray)

A.svd = svd(A.gray)
k = 2
A.approx = A.svd$u[,1:k]%*%diag(A.svd$d[1:k])%*%t(A.svd$v[,1:k])
imagegray(mytrunc(A.approx))

k = 50
A.approx = A.svd$u[,1:k]%*%diag(A.svd$d[1:k])%*%t(A.svd$v[,1:k])
imagegray(mytrunc(A.approx))

Completing a matrix with missing entries

Take a grayscale matrix and delete 50% of all entries. Now the figures are difficult to read.

A.inc = A.gray
pixels = dim(A.inc)[1]*dim(A.inc)[2]
deleteratio = .5
incomplete = sample(pixels, deleteratio*pixels, replace = F)
A.inc[incomplete] <- NA
imagegray(A.inc)

Use the function to fill in the missing entries. Here, we assume that the filled in matrix has rank 100.

A.completion = softImpute(A.inc, rank.max = 100)
names(A.completion)
## [1] "u" "d" "v"
A.comp = A.completion$u%*%diag(A.completion$d)%*%t(A.completion$v)
imagegray(mytrunc(A.comp))

The figures are readable again, and even some of the diagonal stripes are visible.

Now delete a systematic portion of the image. The algorithm is not capable of recovering the lost detail.

 A.inc.1 = A.gray
A.inc.1[450:500,50:100] <- NA
A.completion = softImpute(A.inc.1, rank.max = 200)
A.comp.1 = A.completion$u%*%diag(A.completion$d)%*%t(A.completion$v)
imagegray(mytrunc(A.comp.1))

Principle Component Regression

This topic appeared in Chapter 6, but makes more sense to discuss it now.

head(Hitters)
##                   AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits
## -Andy Allanson      293   66     1   30  29    14     1    293    66
## -Alan Ashby         315   81     7   24  38    39    14   3449   835
## -Alvin Davis        479  130    18   66  72    76     3   1624   457
## -Andre Dawson       496  141    20   65  78    37    11   5628  1575
## -Andres Galarraga   321   87    10   39  42    30     2    396   101
## -Alfredo Griffin    594  169     4   74  51    35    11   4408  1133
##                   CHmRun CRuns CRBI CWalks League Division PutOuts Assists
## -Andy Allanson         1    30   29     14      A        E     446      33
## -Alan Ashby           69   321  414    375      N        W     632      43
## -Alvin Davis          63   224  266    263      A        W     880      82
## -Andre Dawson        225   828  838    354      N        E     200      11
## -Andres Galarraga     12    48   46     33      N        E     805      40
## -Alfredo Griffin      19   501  336    194      A        W     282     421
##                   Errors Salary NewLeague
## -Andy Allanson        20     NA         A
## -Alan Ashby           10  475.0         N
## -Alvin Davis          14  480.0         A
## -Andre Dawson          3  500.0         N
## -Andres Galarraga      4   91.5         N
## -Alfredo Griffin      25  750.0         A

Let’s focus on just the first several numeric columns, and we’ll try to predict Salary from the rest, as we’ve done before

hitters_df <- Hitters[,c(1:9,19)]
head(hitters_df)
##                   AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits
## -Andy Allanson      293   66     1   30  29    14     1    293    66
## -Alan Ashby         315   81     7   24  38    39    14   3449   835
## -Alvin Davis        479  130    18   66  72    76     3   1624   457
## -Andre Dawson       496  141    20   65  78    37    11   5628  1575
## -Andres Galarraga   321   87    10   39  42    30     2    396   101
## -Alfredo Griffin    594  169     4   74  51    35    11   4408  1133
##                   Salary
## -Andy Allanson        NA
## -Alan Ashby        475.0
## -Alvin Davis       480.0
## -Andre Dawson      500.0
## -Andres Galarraga   91.5
## -Alfredo Griffin   750.0

Notice that many of these predictors are highly correlated. What is the problem with having correlated predictors?

cor(hitters_df[,1:9])
##             AtBat       Hits     HmRun        Runs       RBI     Walks
## AtBat  1.00000000 0.96793882 0.5921985 0.913060329 0.8205392 0.6698451
## Hits   0.96793882 1.00000000 0.5621579 0.922187191 0.8110732 0.6412106
## HmRun  0.59219847 0.56215789 1.0000000 0.650988186 0.8551222 0.4810143
## Runs   0.91306033 0.92218719 0.6509882 1.000000000 0.7982057 0.7322128
## RBI    0.82053923 0.81107323 0.8551222 0.798205666 1.0000000 0.6159971
## Walks  0.66984506 0.64121060 0.4810143 0.732212848 0.6159971 1.0000000
## Years  0.04737174 0.04476656 0.1163183 0.004541271 0.1461681 0.1364750
## CAtBat 0.23552636 0.22756487 0.2218821 0.186497388 0.2946884 0.2771748
## CHits  0.25271704 0.25581487 0.2206266 0.204829712 0.3082011 0.2806711
##              Years    CAtBat     CHits
## AtBat  0.047371739 0.2355264 0.2527170
## Hits   0.044766557 0.2275649 0.2558149
## HmRun  0.116318335 0.2218821 0.2206266
## Runs   0.004541271 0.1864974 0.2048297
## RBI    0.146168128 0.2946884 0.3082011
## Walks  0.136474971 0.2771748 0.2806711
## Years  1.000000000 0.9202894 0.9036311
## CAtBat 0.920289361 1.0000000 0.9950635
## CHits  0.903631057 0.9950635 1.0000000

We’ll use Principal Components Regression to first transform the predictors into a new set of orthogonal predictors.

pcr.fit=pcr(Salary ~ ., data=Hitters ,scale=TRUE, validation ="CV")

summary(pcr.fit)
## Data:    X dimension: 263 19 
##  Y dimension: 263 1
## Fit method: svdpc
## Number of components considered: 19
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV             452    350.4    351.6    352.2    350.6    347.0    344.1
## adjCV          452    350.2    351.2    351.7    350.0    346.4    343.3
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       348.5    350.6    352.7     352.0     352.7     354.9     355.3
## adjCV    347.5    349.4    351.4     350.6     351.3     353.4     353.7
##        14 comps  15 comps  16 comps  17 comps  18 comps  19 comps
## CV        349.3     350.4     341.0     339.8     339.8     342.7
## adjCV     347.5     348.6     339.3     338.0     337.8     340.6
## 
## TRAINING: % variance explained
##         1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X         38.31    60.16    70.84    79.03    84.29    88.63    92.26
## Salary    40.63    41.58    42.17    43.22    44.90    46.48    46.69
##         8 comps  9 comps  10 comps  11 comps  12 comps  13 comps  14 comps
## X         94.96    96.28     97.26     97.98     98.65     99.15     99.47
## Salary    46.75    46.86     47.76     47.82     47.85     48.10     50.40
##         15 comps  16 comps  17 comps  18 comps  19 comps
## X          99.75     99.89     99.97     99.99    100.00
## Salary     50.55     53.01     53.85     54.61     54.61
names(pcr.fit)
##  [1] "coefficients"  "scores"        "loadings"      "Yloadings"    
##  [5] "projection"    "Xmeans"        "Ymeans"        "fitted.values"
##  [9] "residuals"     "Xvar"          "Xtotvar"       "fit.time"     
## [13] "na.action"     "ncomp"         "method"        "scale"        
## [17] "validation"    "call"          "terms"         "model"
# performed cross-validation to help select the number of princ comps we should keep
plot(MSEP(pcr.fit))

# other helpful visualizations
plot(pcr.fit,ncomp=3)

plot(pcr.fit, "scores") # plot the data points in the first two PCs

plot(pcr.fit, "loadings")

Exercise

Using the College data, we’ll apply Principle Component Regression to try to predict the variable “Accept”.

  1. Get a baseline for comparison - Use a linear model with all the features included and try to predict the “Accept” feature.

  2. Use ‘pcr()’ to predict the variable Accept. From the summary() output, approximately how many principle components are needed to explain 90% of the variance of ‘Accept’?

  3. Visualize the MSEP versus number of components included in the model.